22 research outputs found

    SLU FOR VOICE COMMAND IN SMART HOME: COMPARISON OF PIPELINE AND END-TO-END APPROACHES

    Get PDF
    International audienceSpoken Language Understanding (SLU) is typically performedthrough automatic speech recognition (ASR) andnatural language understanding (NLU) in a pipeline. However,errors at the ASR stage have a negative impact on theNLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. AlthoughE2E models have shown superior performance to modularapproaches in many NLP tasks, current SLU E2E modelshave still not definitely superseded pipeline approaches.In this paper, we present a comparison of the pipelineand E2E approaches for the task of voice command in smarthomes. Since there are no large non-English domain-specificdata sets available, although needed for an E2E model, wetackle the lack of such data by combining Natural LanguageGeneration (NLG) and text-to-speech (TTS) to generateFrench training data. The trained models were evaluatedon voice commands acquired in a real smart home with severalspeakers. Results show that the E2E approach can reachperformances similar to a state-of-the art pipeline SLU despitea higher WER than the pipeline approach. Furthermore,the E2E model can benefit from artificially generated data toexhibit lower Concept Error Rates than the pipeline baselinefor slot recognition

    Towards End-to-End spoken intent recognition in smart home

    Get PDF
    International audienceVoice based interaction in a smart home has become a feature of many industrial products. These systems react to voice commands, whether it is for answering a question, providing music or turning on the lights. To be efficient, these systems must be able to extract the intent of the user from the voice command. Intent recognition from voice is typically performed through automatic speech recognition (ASR) and intent classification from the transcriptions in a pipeline. However, the errors accumulated at the ASR stage might severely impact the intent classifier. In this paper, we propose an End-to-End (E2E) model to perform intent classification directly from the raw speech input. The E2E approach is thus optimized for this specific task and avoids error propagation. Furthermore, prosodic aspects of the speech signal can be exploited by the E2E model for intent classification (e.g., question vs imperative voice). Experiments on a corpus of voice commands acquired in a real smart home reveal that the state-of-the art pipeline baseline is still superior to the E2E approach. However, using artificial data generation techniques we show that significant improvement to the E2E model can be brought to reach competitive performances. This opens the way to further research on E2E Spoken Language Understanding

    The VocADom Project: Speech Interaction for Well-being and Reliance Improvement

    Get PDF
    International audienceThe additional fee must be paid to ACM. This text field is large enough to hold the appropriate release statement assuming it is single spaced. Every submission will be assigned their own unique DOI string to be included here. Abstract The VocADom project aims to provide audio-based interaction technology that lets the users have full control over their home environment and at eases the social inclusion of the elderly and frail population. This paper presents an overview of the project focusing on multimodal corpus acquisition and labelling and on investigated techniques for speech enhancement and understanding

    Apport des modèles neuronaux de bout-en-bout pour la compréhension automatique de la parole dans l'habitat intelligent

    No full text
    Smart speakers offer the possibility of interacting with smart home systems, and make it possible to issue a range of requests about various subjects. They represent the first ambient voice interfaces that are frequently available in home environments. Very often they are only capable of inferring voice commands of a simple syntax in short utterances in the realm of smart homes that promote home care for senior adults. They support them during everyday situations by improving their quality of life, and also providing assistance in situations of distress. The design of these smart homes mainly focuses on the safety and comfort of its habitants. As a result, these research projects frequently concentrate on human activity detection, resulting in a lack of attention for the communicative aspects in a smart home design. Consequently, there are insufficient speech corpora, specific to the home automation field, in particular for languages other than English. However the availability of these corpora are crucial for developing interactive communication systems between the smart home and its inhabitants. Such corpora at one’s disposal could also contribute to the development of a generation of smart speakers capable of extracting more complex voice commands. As a consequence, part of our work consisted in developing a corpus generator, producing home automation domain specific voice commands, automatically annotated with intent and concept labels. The extraction of intents and concepts from these commands, by a Spoken Language Understanding (SLU) system is necessary to provide the decision-making module with the information, necessary for their execution. In order to react to speech, the natural language understanding (NLU) module is typically preceded by an automatic speech recognition (ASR) module, automatically converting speech into transcriptions. As several studies have shown, the interaction between ASR and NLU in a sequential SLU approach accumulates errors. Therefore, one of the main motivations of our work is the development of an end-to-end SLU module, extracting concepts and intents directly from speech. To achieve this goal, we first develop a sequential SLU approach as our baseline approach, in which a classic ASR method generates transcriptions that are passed to the NLU module, before continuing with the development of an End-to-end SLU module. These two SLU systems were evaluated on a corpus recorded in the home automation domain. We investigate whether the prosodic information that the end-to-end SLU system has access to, contributes to SLU performance. We position the two approaches also by comparing their robustness, facing speech with more semantic and syntactic variation.The context of this thesis is the ANR VocADom project.Les enceintes intelligentes offrent la possibilité d’interagir avec les systèmes informatiques de la maison. Elles permettent d’émettre un éventail de requêtes sur des sujets divers et représentent les premières interfaces vocales disponibles couramment dans les environnements domestiques. La compréhension des commandes vocales concerne des énoncés courts ayant une syntaxe simple, dans le domaine des habitats intelligents destinés à favoriser le maintien à domicile des personnes âgées. Ils les assistent dans leur vie quotidienne, améliorant ainsi leur qualité de vie, mais peuvent aussi leur porter assistance en situations de détresse. La conception de ces habitats se concentre surtout sur les aspects de la sécurité et du confort, ciblant fréquemment sur la détection de l’activité humaine. L’aspect communication est moins pris en compte, c’est pourquoi il existe peu de corpus de parole spécifiques au domaine domotique, en particulier pour des langues autres que l’anglais, alorsqu’ils sont essentiels pour développer les systèmes de communication entre l’habitat et ses habitants. La disponibilité de tels corpus, pourrait contribuer au développement d’une génération d’enceintes intelligentes qui soient capables d’extraire des commandes vocales plus complexes. Pour contourner une telle contrainte, une partie de notre travail consiste à développer un générateur de corpus, produisant des commandes vocales spécifiques au domaine domotique, automatiquement annotées d’étiquettes d’intentions et de concepts. Un système de compréhension de la parole (SLU - Spoken Language Understanding) est nécessaire afin d’extraire les intentions et les concepts des commandes vocales avant de les fournir au module de prise de décision en charge de l’exécution des commandes. De manière classique, un module de compréhension du langage naturel (NLU - Natural Language Understanding) est précédé par un module de reconnaissance automatique de la parole (RAP), convertissant automatiquement la parole en transcriptions. Comme plusieurs études l’ont montré, l’enchaînement entre RAP et NLU dans une approche séquentielle de SLU cumule les erreurs. Par conséquent, l’une des motivations principales de nos travaux est le développement d’un module de SLU de bout en bout (End-to-End) visant à extraire les concepts et les intentions directement de la parole. À cette fin, nous élaborons d’abord une approche SLU séquentielle comme approche de référence, dans laquelle une méthode classique de RAP génère des transcriptions qui sont transmises au module NLU, avant de poursuivre par le développement d’un module de SLU de bout en bout. Ces deux systèmes de SLU sont évalués sur un corpus enregistré spécifiquement au domaine de la domotique. Nous étudions si l’information prosodique, à laquelle la SLU de bout en bout a accès, contribue à augmenter les performances. Nous comparons aussi la robustesse des deux approches lorsqu’elles sont confrontées à un style de parole aux niveaux sémantiques et syntaxiques plus varié.Cette étude est menée dans le cadre du projet VocADom financé par l’appel à projets génériques de l’ANR

    Contribution of end-to-end deep learning models for spoken language understanding in smart homes

    No full text
    Les enceintes intelligentes offrent la possibilité d’interagir avec les systèmes informatiques de la maison. Elles permettent d’émettre un éventail de requêtes sur des sujets divers et représentent les premières interfaces vocales disponibles couramment dans les environnements domestiques. La compréhension des commandes vocales concerne des énoncés courts ayant une syntaxe simple, dans le domaine des habitats intelligents destinés à favoriser le maintien à domicile des personnes âgées. Ils les assistent dans leur vie quotidienne, améliorant ainsi leur qualité de vie, mais peuvent aussi leur porter assistance en situations de détresse. La conception de ces habitats se concentre surtout sur les aspects de la sécurité et du confort, ciblant fréquemment sur la détection de l’activité humaine. L’aspect communication est moins pris en compte, c’est pourquoi il existe peu de corpus de parole spécifiques au domaine domotique, en particulier pour des langues autres que l’anglais, alorsqu’ils sont essentiels pour développer les systèmes de communication entre l’habitat et ses habitants. La disponibilité de tels corpus, pourrait contribuer au développement d’une génération d’enceintes intelligentes qui soient capables d’extraire des commandes vocales plus complexes. Pour contourner une telle contrainte, une partie de notre travail consiste à développer un générateur de corpus, produisant des commandes vocales spécifiques au domaine domotique, automatiquement annotées d’étiquettes d’intentions et de concepts. Un système de compréhension de la parole (SLU - Spoken Language Understanding) est nécessaire afin d’extraire les intentions et les concepts des commandes vocales avant de les fournir au module de prise de décision en charge de l’exécution des commandes. De manière classique, un module de compréhension du langage naturel (NLU - Natural Language Understanding) est précédé par un module de reconnaissance automatique de la parole (RAP), convertissant automatiquement la parole en transcriptions. Comme plusieurs études l’ont montré, l’enchaînement entre RAP et NLU dans une approche séquentielle de SLU cumule les erreurs. Par conséquent, l’une des motivations principales de nos travaux est le développement d’un module de SLU de bout en bout (End-to-End) visant à extraire les concepts et les intentions directement de la parole. À cette fin, nous élaborons d’abord une approche SLU séquentielle comme approche de référence, dans laquelle une méthode classique de RAP génère des transcriptions qui sont transmises au module NLU, avant de poursuivre par le développement d’un module de SLU de bout en bout. Ces deux systèmes de SLU sont évalués sur un corpus enregistré spécifiquement au domaine de la domotique. Nous étudions si l’information prosodique, à laquelle la SLU de bout en bout a accès, contribue à augmenter les performances. Nous comparons aussi la robustesse des deux approches lorsqu’elles sont confrontées à un style de parole aux niveaux sémantiques et syntaxiques plus varié.Cette étude est menée dans le cadre du projet VocADom financé par l’appel à projets génériques de l’ANR.Smart speakers offer the possibility of interacting with smart home systems, and make it possible to issue a range of requests about various subjects. They represent the first ambient voice interfaces that are frequently available in home environments. Very often they are only capable of inferring voice commands of a simple syntax in short utterances in the realm of smart homes that promote home care for senior adults. They support them during everyday situations by improving their quality of life, and also providing assistance in situations of distress. The design of these smart homes mainly focuses on the safety and comfort of its habitants. As a result, these research projects frequently concentrate on human activity detection, resulting in a lack of attention for the communicative aspects in a smart home design. Consequently, there are insufficient speech corpora, specific to the home automation field, in particular for languages other than English. However the availability of these corpora are crucial for developing interactive communication systems between the smart home and its inhabitants. Such corpora at one’s disposal could also contribute to the development of a generation of smart speakers capable of extracting more complex voice commands. As a consequence, part of our work consisted in developing a corpus generator, producing home automation domain specific voice commands, automatically annotated with intent and concept labels. The extraction of intents and concepts from these commands, by a Spoken Language Understanding (SLU) system is necessary to provide the decision-making module with the information, necessary for their execution. In order to react to speech, the natural language understanding (NLU) module is typically preceded by an automatic speech recognition (ASR) module, automatically converting speech into transcriptions. As several studies have shown, the interaction between ASR and NLU in a sequential SLU approach accumulates errors. Therefore, one of the main motivations of our work is the development of an end-to-end SLU module, extracting concepts and intents directly from speech. To achieve this goal, we first develop a sequential SLU approach as our baseline approach, in which a classic ASR method generates transcriptions that are passed to the NLU module, before continuing with the development of an End-to-end SLU module. These two SLU systems were evaluated on a corpus recorded in the home automation domain. We investigate whether the prosodic information that the end-to-end SLU system has access to, contributes to SLU performance. We position the two approaches also by comparing their robustness, facing speech with more semantic and syntactic variation.The context of this thesis is the ANR VocADom project

    Apport des modèles neuronaux de bout-en-bout pour la compréhension automatique de la parole dans l'habitat intelligent

    No full text
    Smart speakers offer the possibility of interacting with smart home systems, and make it possible to issue a range of requests about various subjects. They represent the first ambient voice interfaces that are frequently available in home environments. Very often they are only capable of inferring voice commands of a simple syntax in short utterances in the realm of smart homes that promote home care for senior adults. They support them during everyday situations by improving their quality of life, and also providing assistance in situations of distress. The design of these smart homes mainly focuses on the safety and comfort of its habitants. As a result, these research projects frequently concentrate on human activity detection, resulting in a lack of attention for the communicative aspects in a smart home design. Consequently, there are insufficient speech corpora, specific to the home automation field, in particular for languages other than English. However the availability of these corpora are crucial for developing interactive communication systems between the smart home and its inhabitants. Such corpora at one’s disposal could also contribute to the development of a generation of smart speakers capable of extracting more complex voice commands. As a consequence, part of our work consisted in developing a corpus generator, producing home automation domain specific voice commands, automatically annotated with intent and concept labels. The extraction of intents and concepts from these commands, by a Spoken Language Understanding (SLU) system is necessary to provide the decision-making module with the information, necessary for their execution. In order to react to speech, the natural language understanding (NLU) module is typically preceded by an automatic speech recognition (ASR) module, automatically converting speech into transcriptions. As several studies have shown, the interaction between ASR and NLU in a sequential SLU approach accumulates errors. Therefore, one of the main motivations of our work is the development of an end-to-end SLU module, extracting concepts and intents directly from speech. To achieve this goal, we first develop a sequential SLU approach as our baseline approach, in which a classic ASR method generates transcriptions that are passed to the NLU module, before continuing with the development of an End-to-end SLU module. These two SLU systems were evaluated on a corpus recorded in the home automation domain. We investigate whether the prosodic information that the end-to-end SLU system has access to, contributes to SLU performance. We position the two approaches also by comparing their robustness, facing speech with more semantic and syntactic variation.The context of this thesis is the ANR VocADom project.Les enceintes intelligentes offrent la possibilité d’interagir avec les systèmes informatiques de la maison. Elles permettent d’émettre un éventail de requêtes sur des sujets divers et représentent les premières interfaces vocales disponibles couramment dans les environnements domestiques. La compréhension des commandes vocales concerne des énoncés courts ayant une syntaxe simple, dans le domaine des habitats intelligents destinés à favoriser le maintien à domicile des personnes âgées. Ils les assistent dans leur vie quotidienne, améliorant ainsi leur qualité de vie, mais peuvent aussi leur porter assistance en situations de détresse. La conception de ces habitats se concentre surtout sur les aspects de la sécurité et du confort, ciblant fréquemment sur la détection de l’activité humaine. L’aspect communication est moins pris en compte, c’est pourquoi il existe peu de corpus de parole spécifiques au domaine domotique, en particulier pour des langues autres que l’anglais, alorsqu’ils sont essentiels pour développer les systèmes de communication entre l’habitat et ses habitants. La disponibilité de tels corpus, pourrait contribuer au développement d’une génération d’enceintes intelligentes qui soient capables d’extraire des commandes vocales plus complexes. Pour contourner une telle contrainte, une partie de notre travail consiste à développer un générateur de corpus, produisant des commandes vocales spécifiques au domaine domotique, automatiquement annotées d’étiquettes d’intentions et de concepts. Un système de compréhension de la parole (SLU - Spoken Language Understanding) est nécessaire afin d’extraire les intentions et les concepts des commandes vocales avant de les fournir au module de prise de décision en charge de l’exécution des commandes. De manière classique, un module de compréhension du langage naturel (NLU - Natural Language Understanding) est précédé par un module de reconnaissance automatique de la parole (RAP), convertissant automatiquement la parole en transcriptions. Comme plusieurs études l’ont montré, l’enchaînement entre RAP et NLU dans une approche séquentielle de SLU cumule les erreurs. Par conséquent, l’une des motivations principales de nos travaux est le développement d’un module de SLU de bout en bout (End-to-End) visant à extraire les concepts et les intentions directement de la parole. À cette fin, nous élaborons d’abord une approche SLU séquentielle comme approche de référence, dans laquelle une méthode classique de RAP génère des transcriptions qui sont transmises au module NLU, avant de poursuivre par le développement d’un module de SLU de bout en bout. Ces deux systèmes de SLU sont évalués sur un corpus enregistré spécifiquement au domaine de la domotique. Nous étudions si l’information prosodique, à laquelle la SLU de bout en bout a accès, contribue à augmenter les performances. Nous comparons aussi la robustesse des deux approches lorsqu’elles sont confrontées à un style de parole aux niveaux sémantiques et syntaxiques plus varié.Cette étude est menée dans le cadre du projet VocADom financé par l’appel à projets génériques de l’ANR

    Towards end-to-end spoken intent recognition in smart home

    No full text
    Voice based interaction in a smart home has become a feature of many industrial products. These systems react to voice commands, whether it is for answering a question, providing music or turning on the lights. To be efficient, these systems must be able to extract the intent of the user from the voice command. Intent recognition from voice is typically performed through automatic speech recognition (ASR) and intent classification from the transcriptions in a pipeline. However, the errors accumulated at the ASR stage might severely impact the intent classifier. In this paper, we propose an End-to-End (E2E) model to perform intent classification directly from the raw speech input. The E2E approach is thus optimized for this specific task and avoids error propagation. Furthermore, prosodic aspects of the speech signal can be exploited by the E2E model for intent classification (e.g., question vs imperative voice). Experiments on a corpus of voice commands acquired in a real smart home reveal that the state-of-the art pipeline baseline is still superior to the E2E approach. However, using artificial data generation techniques we show that significant improvement to the E2E model can be brought to reach competitive performances. This opens the way to further research on E2E Spoken Language Understanding

    Corpus generation for voice command in smart home and the effect of speech synthesis on End-to-End SLU

    No full text
    Massive amounts of annotated data greatly contributed to the advance of the machine learning field. However such large data sets are often unavailable for novel tasks performed in realistic environments such as smart homes. In this domain, semantically annotated large voice command corpora for Spoken Language Understanding (SLU) are scarce, especially for non-English languages. We present the automatic generation process of a synthetic semantically-annotated corpus of French commands for smart-home to train pipeline and End-to-End (E2E) SLU models. SLU is typically performed through Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) in a pipeline. Since errors at the ASR stage reduce the NLU performance, an alternative approach is End-to-End (E2E) SLU to jointly perform ASR and NLU. To that end, the artificial corpus was fed to a text-to-speech (TTS) system to generate synthetic speech data. All models were evaluated on voice commands acquired in a real smart home. We show that artificial data can be combined with real data within the same training set or used as a stand-alone training corpus. The synthetic speech quality was assessed by comparing it to real data using dynamic time warping (DTW)

    Event prominence extraction combining a knowledge-based syntactic parser and a BERT classifier for Dutch

    No full text
    A core task in information extraction is event detection that identifies event triggers in sentences that are typically classified into event types. In this study an event is considered as the unit to measure diversity and similarity in news articles in the framework of a news recommendation system. Current typology-based event detection approaches fail to handle the variety of events expressed in real-world situations. To overcome this, we aim to perform event salience classification and explore whether a transformer model is capable of classifying new information into less and more general prominence classes. After comparing a Support Vector Machine (SVM) baseline and our transformer-based classifier performances on several event span formats, we conceived multi-word event spans as syntactic clauses. Those are fed into our prominence classifier which is fine-tuned on pre-trained Dutch BERT word embeddings. On top of that we outperform a pipeline of a Conditional Random Field (CRF) approach to event-trigger word detection and the BERT-based classifier. To the best of our knowledge we present the first event extraction approach that combines an expert-based syntactic parser with a transformer-based classifier for Dutch

    SLU FOR VOICE COMMAND IN SMART HOME: COMPARISON OF PIPELINE AND END-TO-END APPROACHES

    Get PDF
    International audienceSpoken Language Understanding (SLU) is typically performedthrough automatic speech recognition (ASR) andnatural language understanding (NLU) in a pipeline. However,errors at the ASR stage have a negative impact on theNLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. AlthoughE2E models have shown superior performance to modularapproaches in many NLP tasks, current SLU E2E modelshave still not definitely superseded pipeline approaches.In this paper, we present a comparison of the pipelineand E2E approaches for the task of voice command in smarthomes. Since there are no large non-English domain-specificdata sets available, although needed for an E2E model, wetackle the lack of such data by combining Natural LanguageGeneration (NLG) and text-to-speech (TTS) to generateFrench training data. The trained models were evaluatedon voice commands acquired in a real smart home with severalspeakers. Results show that the E2E approach can reachperformances similar to a state-of-the art pipeline SLU despitea higher WER than the pipeline approach. Furthermore,the E2E model can benefit from artificially generated data toexhibit lower Concept Error Rates than the pipeline baselinefor slot recognition
    corecore